Goto

Collaborating Authors

 high dimensional data


Diffusion Curvature for Estimating Local Curvature in High Dimensional Data

Neural Information Processing Systems

We introduce a new intrinsic measure of local curvature on point-cloud data called diffusion curvature. Our measure uses the framework of diffusion maps, including the data diffusion operator, to structure point cloud data and define local curvature based on the laziness of a random walk starting at a point or region of the data. We show that this laziness directly relates to volume comparison results from Riemannian geometry. We then extend this scalar curvature notion to an entire quadratic form using neural network estimations based on the diffusion map of point-cloud data. We show applications of both estimations on toy data, single-cell data, and on estimating local Hessian matrices of neural network loss landscapes.



Into the Void: Mapping the Unseen Gaps in High Dimensional Data

Zhang, Xinyu, Estro, Tyler, Kuenning, Geoff, Zadok, Erez, Mueller, Klaus

arXiv.org Artificial Intelligence

We present a comprehensive pipeline, augmented by a visual analytics system named ``GapMiner'', that is aimed at exploring and exploiting untapped opportunities within the empty areas of high-dimensional datasets. Our approach begins with an initial dataset and then uses a novel Empty Space Search Algorithm (ESA) to identify the center points of these uncharted voids, which are regarded as reservoirs containing potentially valuable novel configurations. Initially, this process is guided by user interactions facilitated by GapMiner. GapMiner visualizes the Empty Space Configurations (ESC) identified by the search within the context of the data, enabling domain experts to explore and adjust ESCs using a linked parallel-coordinate display. These interactions enhance the dataset and contribute to the iterative training of a connected deep neural network (DNN). As the DNN trains, it gradually assumes the task of identifying high-potential ESCs, diminishing the need for direct user involvement. Ultimately, once the DNN achieves adequate accuracy, it autonomously guides the exploration of optimal configurations by predicting performance and refining configurations, using a combination of gradient ascent and improved empty-space searches. Domain users were actively engaged throughout the development of our system. Our findings demonstrate that our methodology consistently produces substantially superior novel configurations compared to conventional randomization-based methods. We illustrate the effectiveness of our method through several case studies addressing various objectives, including parameter optimization, adversarial learning, and reinforcement learning.


Diffusion Curvature for Estimating Local Curvature in High Dimensional Data

Neural Information Processing Systems

We introduce a new intrinsic measure of local curvature on point-cloud data called diffusion curvature. Our measure uses the framework of diffusion maps, including the data diffusion operator, to structure point cloud data and define local curvature based on the laziness of a random walk starting at a point or region of the data. We show that this laziness directly relates to volume comparison results from Riemannian geometry. We then extend this scalar curvature notion to an entire quadratic form using neural network estimations based on the diffusion map of point-cloud data. We show applications of both estimations on toy data, single-cell data, and on estimating local Hessian matrices of neural network loss landscapes.


Sparse Modelling for Feature Learning in High Dimensional Data

Neelam, Harish, Veerella, Koushik Sai, Biswas, Souradip

arXiv.org Artificial Intelligence

This paper presents an innovative approach to dimensionality reduction and feature extraction in high-dimensional datasets, with a specific application focus on wood surface defect detection. The proposed framework integrates sparse modeling techniques, particularly Lasso and proximal gradient methods, into a comprehensive pipeline for efficient and interpretable feature selection. Leveraging pre-trained models such as VGG19 and incorporating anomaly detection methods like Isolation Forest and Local Outlier Factor, our methodology addresses the challenge of extracting meaningful features from complex datasets. Evaluation metrics such as accuracy and F1 score, alongside visualizations, are employed to assess the performance of the sparse modeling techniques. Through this work, we aim to advance the understanding and application of sparse modeling in machine learning, particularly in the context of wood surface defect detection.


Formation-Controlled Dimensionality Reduction

Jeong, Taeuk, Jung, Yoon Mo

arXiv.org Artificial Intelligence

Dimensionality reduction represents the process of extracting low dimensional structure from high dimensional data. High dimensional data include multimedia databases, gene expression microarrays, and financial time series, for example. In order to deal with such real-world data properly, it is better to reduce its dimensionality to avoid undesired properties of high dimensions such as the curse of dimensionality [14, 11]. As a result, classification, visualization, and compression of data can be expedited, for example [14]. In many problems, it is presumed that the dimensionality of the measured data is only artificially high; the measured data are high-dimensional but data nearly have a lower-dimensional structure, since they are multiple, indirect measurements of an underlying factors, which typically cannot be directly calibrated [4].


ICA with Reconstruction Cost for Efficient Overcomplete Feature Learning

Neural Information Processing Systems

Independent Components Analysis (ICA) and its variants have been successfully used for unsupervised feature learning. However, standard ICA requires an orthonoramlity constraint to be enforced, which makes it difficult to learn overcomplete features. In addition, ICA is sensitive to whitening. These properties make it challenging to scale ICA to high dimensional data. In this paper, we propose a robust soft reconstruction cost for ICA that allows us to learn highly overcomplete sparse features even on unwhitened data.


Bayesian Probabilistic Co-Subspace Addition

Neural Information Processing Systems

For modeling data matrices, this paper introduces Probabilistic Co-Subspace Addition (PCSA) model by simultaneously capturing the dependent structures among both rows and columns. Briefly, PCSA assumes that each entry of a matrix is generated by the additive combination of the linear mappings of two low-dimensional features, which distribute in the row-wise and column-wise latent subspaces respectively. In consequence, PCSA captures the dependencies among entries intricately, and is able to handle non-Gaussian and heteroscedastic densities. By formulating the posterior updating into the task of solving Sylvester equations, we propose an efficient variational inference algorithm. Furthermore, PCSA is extended to tackling and filling missing values, to adapting model sparseness, and to modelling tensor data. In comparison with several state-of-art methods, experiments demonstrate the effectiveness and efficiency of Bayesian (sparse) PCSA on modeling matrix (tensor) data and filling missing values.


Sample Complexity of Testing the Manifold Hypothesis

Neural Information Processing Systems

The hypothesis that high dimensional data tends to lie in the vicinity of a low dimensional manifold is the basis of a collection of methodologies termed Manifold Learning. In this paper, we study statistical aspects of the question of fitting a manifold with a nearly optimal least squared error. Given upper bounds on the dimension, volume, and curvature, we show that Empirical Risk Minimization can produce a nearly optimal manifold using a number of random samples that is {\it independent} of the ambient dimension of the space in which data lie. We obtain an upper bound on the required number of samples that depends polynomially on the curvature, exponentially on the intrinsic dimension, and linearly on the intrinsic volume. For constant error, we prove a matching minimax lower bound on the sample complexity that shows that this dependence on intrinsic dimension, volume and curvature is unavoidable.


Gaussian Process Latent Variable Models for Visualisation of High Dimensional Data

Neural Information Processing Systems

In this paper we introduce a new underlying probabilistic model for prin- cipal component analysis (PCA). Our formulation interprets PCA as a particular Gaussian process prior on a mapping from a latent space to the observed data-space. We show that if the prior's covariance func- tion constrains the mappings to be linear the model is equivalent to PCA, we then extend the model by considering less restrictive covariance func- tions which allow non-linear mappings. This more general Gaussian pro- cess latent variable model (GPLVM) is then evaluated as an approach to the visualisation of high dimensional data for three different data-sets. Additionally our non-linear algorithm can be further kernelised leading to'twin kernel PCA' in which a mapping between feature spaces occurs.